Lillehammer
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Gu, Ken, Zhang, Zhihan, Lin, Kate, Zhang, Yuwei, Paruchuri, Akshay, Yu, Hong, Kazemi, Mehran, Ayush, Kumar, Heydari, A. Ali, Xu, Maxwell A., Narayanswamy, Girish, Liu, Yun, Poh, Ming-Zher, Yang, Yuzhe, Malhotra, Mark, Patel, Shwetak, Palangi, Hamid, Xu, Xuhai, McDuff, Daniel, Althoff, Tim, Liu, Xin
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
- Asia > Middle East > UAE (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- Europe > United Kingdom > Wales (0.04)
- (43 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
HistoryBankQA: Multilingual Temporal Question Answering on Historical Events
Mandal, Biswadip, Khandelwal, Anant, Gupta, Manish
Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.
- Leisure & Entertainment > Sports (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Media (0.68)
- Law (0.67)
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach
Coleman, Chad, Neuman, W. Russell, Dasdan, Ali, Ali, Safinah, Shah, Manan
As large language models (LLMs) are increasingly deployed in consequential decision - making contexts, systematically assessing their ethical reasoning capabilities becomes a critical imperative. This paper introduces the Priorities in Reasoning and Intrinsi c Moral Evaluation (PRIME) framework -- a comprehensive methodology for analyzing moral priorities across foundational ethical dimensions including consequentialist - deontological reasoning, moral foundations theory, and Kohlberg's developmental stages. We app ly this framework to six leading LLMs through a dual - protocol approach combining direct questioning and response analysis to established ethical dilemmas. Our analysis reveals striking patterns of convergence: all evaluated models demonstrate strong priori tization of care/harm and fairness/cheating foundations while consistently underweighting authority, loyalty, and sanctity dimensions. Through detailed examination of confidence metrics, response reluctance patterns, and reasoning consistency, we establish that contemporary LLMs (1) produce decisive ethical judgments, (2) demonstrate notable cross - model alignment in moral decision - making, and (3) generally correspond with empirically established human moral preferences. This research contributes a scalable, extensible methodology for ethical benchmarking while highlighting both the promising capabilities and systematic limitations in current AI moral reasoning architectures -- insights critical for responsible development as these systems assume increasingly si gnificant societal roles. The rapid evolution of generative large language models (LLMs) has brought the alignment issue to the forefront of AI ethics discussions - specifically, whether these models are appropriately aligned with human values (Bostrom, 2014; Tegmark 2017; Russell 2019; Kosinski, 2024). As these powerful models are increasingly integrated into decision - making processes across various societal domains (Salazar, A., & Kunc, M., 2025), understanding whether and how their operational logic aligns with fundamental human values becomes not just an academic question, but a critical societal imperative. In this paper we will present an analytical framework and findings to address the first two questions, and a preliminary exploratory analysis of the third. We will make the case that the answers to these questions are: yes, yes and yes. There are caveats and exceptions, of course, but the broad pattern, we believe, is clear. Our methodology permits us to explore not just what choices they make, but the reasoning chain of thought that leads to those decisions.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Auditing the Ethical Logic of Generative AI Models
Neuman, W. Russell, Coleman, Chad, Dasdan, Ali, Ali, Safinah, Shah, Manan
As generative AI models become increasingly integrated into high-stakes domains, the need for robust methods to evaluate their ethical reasoning becomes increasingly important. This paper introduces a five-dimensional audit model -- assessing Analytic Quality, Breadth of Ethical Considerations, Depth of Explanation, Consistency, and Decisiveness -- to evaluate the ethical logic of leading large language models (LLMs). Drawing on traditions from applied ethics and higher-order thinking, we present a multi-battery prompt approach, including novel ethical dilemmas, to probe the models' reasoning across diverse contexts. We benchmark seven major LLMs finding that while models generally converge on ethical decisions, they vary in explanatory rigor and moral prioritization. Chain-of-Thought prompting and reasoning-optimized models significantly enhance performance on our audit metrics. This study introduces a scalable methodology for ethical benchmarking of AI systems and highlights the potential for AI to complement human moral reasoning in complex decision-making contexts.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Minnesota (0.04)
- (5 more...)
- Law (1.00)
- Government (1.00)
- Education (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.84)
Analyzing the Ethical Logic of Six Large Language Models
Neuman, W. Russell, Coleman, Chad, Shah, Manan
This study examines the ethical reasoning of six prominent generative large language models: OpenAI GPT-4o, Meta LLaMA 3.1, Perplexity, Anthropic Claude 3.5 Sonnet, Google Gemini, and Mistral 7B. The research explores how these models articulate and apply ethical logic, particularly in response to moral dilemmas such as the Trolley Problem, and Heinz Dilemma. Departing from traditional alignment studies, the study adopts an explainability-transparency framework, prompting models to explain their ethical reasoning. This approach is analyzed through three established ethical typologies: the consequentialist-deontological analytic, Moral Foundations Theory, and the Kohlberg Stages of Moral Development Model. Findings reveal that LLMs exhibit largely convergent ethical logic, marked by a rationalist, consequentialist emphasis, with decisions often prioritizing harm minimization and fairness. Despite similarities in pre-training and model architecture, a mixture of nuanced and significant differences in ethical reasoning emerge across models, reflecting variations in fine-tuning and post-training processes. The models consistently display erudition, caution, and self-awareness, presenting ethical reasoning akin to a graduate-level discourse in moral philosophy. In striking uniformity these systems all describe their ethical reasoning as more sophisticated than what is characteristic of typical human moral logic.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Norway > Eastern Norway > Innlandet > Lillehammer (0.04)
- (8 more...)
- Personal (1.00)
- Research Report > Experimental Study (0.48)
- Law (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
PersonalSum: A User-Subjective Guided Personalized Summarization Dataset for Large Language Models
Zhang, Lemei, Liu, Peng, Henriksboe, Marcus Tiedemann Oekland, Lauvrak, Even W., Gulla, Jon Atle, Ramampiaro, Heri
With the rapid advancement of Natural Language Processing in recent years, numerous studies have shown that generic summaries generated by Large Language Models (LLMs) can sometimes surpass those annotated by experts, such as journalists, according to human evaluations. However, there is limited research on whether these generic summaries meet the individual needs of ordinary people. The biggest obstacle is the lack of human-annotated datasets from the general public. Existing work on personalized summarization often relies on pseudo datasets created from generic summarization datasets or controllable tasks that focus on specific named entities or other aspects, such as the length and specificity of generated summaries, collected from hypothetical tasks without the annotators' initiative. To bridge this gap, we propose a high-quality, personalized, manually annotated abstractive summarization dataset called PersonalSum. This dataset is the first to investigate whether the focus of public readers differs from the generic summaries generated by LLMs. It includes user profiles, personalized summaries accompanied by source sentences from given articles, and machine-generated generic summaries along with their sources. We investigate several personal signals -- entities/topics, plot, and structure of articles--that may affect the generation of personalized summaries using LLMs in a few-shot in-context learning scenario. Our preliminary results and analysis indicate that entities/topics are merely one of the key factors that impact the diverse preferences of users, and personalized summarization remains a significant challenge for existing LLMs.
Truncated Kernel Stochastic Gradient Descent on Spheres
Inspired by the structure of spherical harmonics, we propose the truncated kernel stochastic gradient descent (T-kernel SGD) algorithm with a least-square loss function for spherical data fitting. T-kernel SGD employs a "truncation" operation, enabling the application of series-based kernels function in stochastic gradient descent, thereby avoiding the difficulties of finding suitable closed-form kernel functions in high-dimensional spaces. In contrast to traditional kernel SGD, T-kernel SGD is more effective in balancing bias and variance by dynamically adjusting the hypothesis space during iterations. The most significant advantage of the proposed algorithm is that it can achieve theoretically optimal convergence rates using a constant step size (independent of the sample size) while overcoming the inherent saturation problem of kernel SGD. Additionally, we leverage the structure of spherical polynomials to derive an equivalent T-kernel SGD, significantly reducing storage and computational costs compared to kernel SGD. Typically, T-kernel SGD requires only $\mathcal{O}(n^{1+\frac{d}{d-1}\epsilon})$ computational complexity and $\mathcal{O}(n^{\frac{d}{d-1}\epsilon})$ storage to achieve optimal rates for the d-dimensional sphere, where $0<\epsilon<\frac{1}{2}$ can be arbitrarily small if the optimal fitting or the underlying space possesses sufficient regularity. This regularity is determined by the smoothness parameter of the objective function and the decaying rate of the eigenvalues of the integral operator associated with the kernel function, both of which reflect the difficulty of the estimation problem. Our main results quantitatively characterize how this prior information influences the convergence of T-kernel SGD. The numerical experiments further validate the theoretical findings presented in this paper.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (5 more...)
Text mining in education
Ferreira-Mello, R., Andre, M., Pinheiro, A., Costa, E., Romero, C.
The explosive growth of online education environments is generating a massive volume of data, specially in text format from forums, chats, social networks, assessments, essays, among others. It produces exciting challenges on how to mine text data in order to find useful knowledge for educational stakeholders. Despite the increasing number of educational applications of text mining published recently, we have not found any paper surveying them. In this line, this work presents a systematic overview of the current status of the Educational Text Mining field. Our final goal is to answer three main research questions: Which are the text mining techniques most used in educational environments? Which are the most used educational resources? And which are the main applications or educational goals? Finally, we outline the conclusions and the more interesting future trends.
- South America > Brazil > Pernambuco (0.04)
- South America > Brazil > Alagoas (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Instructional Material > Course Syllabus & Notes (0.93)
- Education > Educational Technology > Educational Software > Computer Based Training (1.00)
- Education > Educational Setting > Online (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
- Education > Assessment & Standards > Student Performance (0.93)